03 09 2021

App description

The main goal of this course project is building simple word prediction application for saving some user’s typing time. This study is based on SwiftKey datasets. Because of computational limitations we used only random sample of near 20% of this data, binded together.

App uses two simple algorithms - SBO and correlation. Stupid back-off algorithm (SBO) predicts words, based on frequencies of n-grams. SBO is doing well for frequently used, but non-meaningful words, such as articles, prepositions and pronouns.

Second part of prediction algorithm uses only “meaningful” words to predict next “meaningful” word, based on word correlation in single text block of few sentences. This part of algorithm is our attempt to improve SBO algorithm, using all the text, not just the last n-gram. Correlation algorithm has much less accuracy, than SBO, because the most common words are excluded. Predictions are associated with previews text, so this algorithm not just predicts, but also helps user to find necessary words.

SBO is main, and Correlation is auxiliary word prediction method. Words from this two predictions could be just bound into single list, or used as starting point for more complicated grammar-wise prediction algorithm, which is out of this study.

Code for this project: https://github.com/OleksandrMyronov/Capstone_Project

App interface

Print some text to the input, then App predicts SBO and Correlation word lists (if there are no words from correlation database - just SBO). Visualization of SBO as word graph is displayed on the right side, with some sliders to change number of word branches and chain length.

SBO algorithm

SBO algorithm from sbo R package is based on frequency of n-grams. If there are no such n-gram in list, SBO uses low order n-grams list. Plot below shows SBO model efficiency, evaluated on sample from SwiftKey datasets. SBO datatable with 15 predictions has 102 kb size for bigrams, 16.6 Mb for trigrams and 100 Mb for tetragrams, so for our purposes the most reasonable choice should be trigrams for 10 predicted words with 28.5% accuracy.

## [1] "File size of models with 15 predicted words:"
##   word bigram trigram tetragram
## 1 16kb  102kb    16Mb    100 Mb

Correlation algorithm

Correlation algorithm is based on frequency of word pairs, when words appear together (but not only side-by-side) in single text block. Algorithm uses only words from dictionary, which covers 85% of the corpora (this value could be increased, but memory usage would also increase dramatically), non-meaningful words from stop_words list are excluded.

For purposes of decreasing memory usage, we filter only positive correlations, higher than 0.01, this also reduces number of possible predictions, but for 10 predicted words it’s mostly sufficient. Algorithm produces also negative coefficients (like “love”-“government” pair with -0.0096 correlation). Correlation range is (-0.015 : 0.491) so negatives would not have much impact, and we are not interested in words, which probably would not appear.

In prediction algorithm we filter all the word pairs from correlation database, by first word which should be in the list of the words from text. Then, we group them by second (predicted) word and add all the correlation coefficients. Words with higher coefficients are our predictions. This algorithm may efficiently find associated words for all the text block, not only for last n-gram. Algorithm for this small app just summarizes all the correlation terms. For large text blocks we should use “fading memory” penalizing coefficient, so that words far away from current should be less correlated.

Correlation algorithm has low accuracy in compare with SBO. Single predicted Correlation word adds near 0.4% accuracy, ten words have near 2% accuracy, but is more memory-effective than just increasing number of n-gram prediction. Correlation model file for this app has 8.3 Mb size (dictionary with 85% corpora covering and filtering correlation >0.01).